4  Lab Part

School of Accounting, Finance and Economics

BECS2002 / Econometrics and Data Analytics

Lab Handbook

Week 11

Module coordinator: Dr Camilo Calderon

Email: cam.calderon@dmu.ac.uk

This version: 07/01/2024

This handbook has been produced to provide students with specific information and guidance about their labs. The contents of this handbook are indicative, this means they can be updated or modified upon the review of the module leader (Cam).

An electronic version of this handbook (which is continuously updated) is available on the VLE system, Blackboard, which you should consult regularly as the main reference point throughout your studies.

Indicative Contents

Lecture 1 – Regression Models I: trend and seasonality, 4

Lecture 2 – Regression models II: trend and seasonality, CH6 9

Video 1: Lab 14 – Inflation and Models with Trend and Seasonality 13

Video 2: Lab 17 Confusion Matrix, Melbourne rainfall forecast, what are odds of an event happening? 21

Lab 12: Linear Regression and autocorrelation 23

Lab 13: Linear regression and evaluating predictability 30

Lab 14: Regression and external information 41

Lab 15: Binary forecast and logistic regression 45

Lab 16: Neural Networks I 50

Lab 17: neural Networks II 58

PRACTICE LABS 59

Lab 13 – models with trend and seasonality 59

Lab 15- Foreign Exchange Rate Market 61

Lab 16 – Money Market and Bond Yield Polls ; Why inverted yield curves are important? 66

Lab 18 – Neural Networks 70

Supplementary Lab 1 74

Supplementary Lab – Social, Macro Overview and Signal 76

Supplementary Lab – Alternative investing and CBA 80

Supplementary Lab - Willingness to pay 86

Supplementary Lab – R Shiny 89

Lecture 1 – Regression Models I: trend and seasonality,

Example used in this lab

Cam Calderon

2023-12-09

R Markdown

5 REGRESSION MODELS - TREND AND SEASONALITY - 09-12-2023

6 Required libraries

library(readxl) # Uncomment the line below to install the ‘forecast’ package if it’s not already installed # install.packages(“forecast”) library(forecast)

6.1 Registered S3 method overwritten by ‘quantmod’:

6.2 method from

6.3 as.zoo.data.frame zoo

7 Loading FDI data from the Office for National Statistics (ONS)

ONS_fdi <- read_excel(“C:/Users/Cam/OneDrive - De Montfort University/a- BECS2002 2023-24 - PUBLIC/002 Datasets/fdi.xls”, skip = 7) View(ONS_fdi)

8 Converting the data into a time series object

fdi.ts <- ts(ONS_fdi$fdi, start = c(1987,1), end = c(2023, 2), freq = 4)

9 Plotting the time series data

plot(fdi.ts, xlab = “Year”, ylab = “FDI Inward”, main = “FDI Inward Flow”)

10 Partitioning the dataset into training and validation sets

stepsAhead <- 40 nTrain <- length(fdi.ts) - stepsAhead train.ts <- window(fdi.ts, start = c(1987, 1), end = c(1987,nTrain)) valid.ts <- window(fdi.ts, start = c(1987, nTrain + 1), end = c(1987, nTrain + stepsAhead))

11 Fitting a linear model with additive seasonality

train.lm.trend.season <- tslm(train.ts ~ trend + I(trend^2) + season) summary(train.lm.trend.season)

11.1

11.2 Call:

11.3 tslm(formula = train.ts ~ trend + I(trend^2) + season)

11.4

11.5 Residuals:

11.6 Min 1Q Median 3Q Max

11.7 -69865 -16822 6605 18803 42921

11.8

11.9 Coefficients:

11.10 Estimate Std. Error t value Pr(>|t|)

11.11 (Intercept) 86271.361 9776.832 8.824 3.72e-14 ***

11.12 trend -2840.422 371.997 -7.636 1.37e-11 ***

11.13 I(trend^2) 84.977 3.368 25.228 < 2e-16 ***

11.14 season2 197.592 7896.358 0.025 0.980

11.15 season3 3015.444 7974.934 0.378 0.706

11.16 season4 2322.794 7975.466 0.291 0.771

11.17

11.18 Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

11.19

11.20 Residual standard error: 29010 on 100 degrees of freedom

11.21 Multiple R-squared: 0.9813, Adjusted R-squared: 0.9804

11.22 F-statistic: 1049 on 5 and 100 DF, p-value: < 2.2e-16

12 Forecasting using the linear model

train.lm.t.s.pred <- forecast(train.lm.trend.season, h = stepsAhead)

13 Plotting the forecast results

plot(train.ts, ylim = c(0, 2000000), ylab = “FDI Inward (millions of pounds)”, xlab = “Year”, bty = “l”, xaxt = “n”, xlim = c(1987,2024), main =“Total FDI, Equity Capital & Reinvested Earnings”, lty = 1) axis(1, at = seq(1987, 2024, 2), labels = format(seq(1987, 2024, 2))) lines(valid.ts, lty=2, lwd=2) lines(train.lm.t.s.pred$mean, lty=2, lwd=2, col=“blue”)

14 Assessing forecast accuracy

fcast <- forecast(train.lm.trend.season, h=40) accuracy(fcast$mean, valid.ts)

14.1 ME RMSE MAE MPE MAPE ACF1 Theil’s U

14.2 Test set 258355.9 356540.8 267170.7 14.88078 15.92279 0.9231679 4.896507

15 Naive forecast for the training period

snaive4cast <- snaive(train.ts, h= 40) naive <- forecast(snaive4cast, h=40) accuracy(snaive4cast, valid.ts)

15.1 ME RMSE MAE MPE MAPE MASE ACF1

15.2 Training set 28033.81 44826.92 30468.36 10.31471 11.32815 1.00000 0.8273804

15.3 Test set 601469.28 755172.46 601525.57 37.40444 37.41183 19.74263 0.9297684

15.4 Theil’s U

15.5 Training set NA

15.6 Test set 10.77091

16 Plotting the naive forecast

lines(naive$mean, lty = 2, lwd = 2, col = “red”)

17 Using ETS models for forecasting

fdi.ets.aan <- ets(train.ts, model=“AAN”) fdi.ets.ann <- ets(train.ts, model=“ANN”, damped=FALSE) fdi.ets.zzz <- ets(train.ts, model=“ZZZ”, damped=TRUE)

fdi.ets.aan.pred <- forecast(fdi.ets.aan, h=40, level= c(0.2,0.4, 0.6, 0.8) ) fdi.ets.ann.pred <- forecast(fdi.ets.ann, h=40, level= c(0.2,0.4, 0.6, 0.8) ) fdi.ets.zzz.pred <- forecast(fdi.ets.zzz, h=40, level= c(0.2,0.4, 0.6, 0.8) )

18 Plotting ETS model forecasts

lines(fdi.ets.aan.pred\(mean, lty=2, col="grey", lwd=2) lines(fdi.ets.ann.pred\)mean, lty=2, col=“green”, lwd=2) lines(fdi.ets.zzz.pred$mean, lty=2, col=“orange”, lwd=2)

19 Evaluating ETS model accuracy

accuracy(fdi.ets.aan.pred$mean,valid.ts)

19.1 ME RMSE MAE MPE MAPE ACF1 Theil’s U

19.2 Test set 289297.9 413346.4 308538.1 16.11673 18.44179 0.9251353 5.634319

accuracy(fdi.ets.ann.pred$mean,valid.ts)

19.3 ME RMSE MAE MPE MAPE ACF1 Theil’s U

19.4 Test set 584104.6 742120.7 585794.8 35.93561 36.15939 0.927391 10.52316

accuracy(fdi.ets.zzz.pred$mean,valid.ts)

19.5 ME RMSE MAE MPE MAPE ACF1 Theil’s U

19.6 Test set 340110.2 483237.1 357825 19.0682 21.23053 0.9260083 6.566403

20 Adding a legend

legend(“topleft”, # position of the legend legend = c(“Validation”, “Linear Model”, “Naive Forecast”, “ETS AAN”, “ETS ANN”, “ETS ZZZ”), # text in the legend col = c(“black”,“blue”, “red”, “grey”, “green”, “orange”), # colors of the lines lty = 2, # line types lwd = 2) # line widths

Lecture 2 – Regression models II: trend and seasonality, CH6

Example used in this lab

Cam Calderon

2023-12-09

R Markdown

21 REGRESSION MODELS - TREND AND SEASONALITY - 09-12-2023

22 Load necessary libraries

library(readxl) # Uncomment the next line to install the ‘forecast’ package if not already installed # install.packages(“forecast”) library(forecast)

22.1 Registered S3 method overwritten by ‘quantmod’:

22.2 method from

22.3 as.zoo.data.frame zoo

23 Load retail sales index data from ONS

ONS_retail <- read_excel(“C:/Users/Cam/OneDrive - De Montfort University/a- BECS2002 2023-24 - PUBLIC/002 Datasets/rsi 2023.xlsx”, skip = 10)

#View(ONS_retail)

24 Convert retail sales data into a time series object

retail.ts <- ts(ONS_retail$rsi, start = c(1996,1), end = c(2023, 10), freq = 12)

25 Initial time series plot for retail sales index

plot(retail.ts, xlab = “Year”, ylab = “Retail Sales Index - J5AH”, main = “Value of Retail Sales at Current Prices Non-SA”)

26 Split the dataset into training and validation sets

stepsAhead <- 12 nTrain <- length(retail.ts) - stepsAhead train.ts <- window(retail.ts, start = c(1996, 1), end = c(1996,nTrain)) valid.ts <- window(retail.ts, start = c(1996, nTrain + 1), end = c(1996, nTrain + stepsAhead))

27 Fit a time series linear model with trend and seasonality

train.lm.trend.season <- tslm(train.ts ~ trend + I(trend^2) + season) summary(train.lm.trend.season)

27.1

27.2 Call:

27.3 tslm(formula = train.ts ~ trend + I(trend^2) + season)

27.4

27.5 Residuals:

27.6 Min 1Q Median 3Q Max

27.7 -23.9822 -1.0645 0.0657 1.3221 10.6067

27.8

27.9 Coefficients:

27.10 Estimate Std. Error t value Pr(>|t|)

27.11 (Intercept) 3.962e+01 7.251e-01 54.649 < 2e-16 ***

27.12 trend 1.250e-01 7.034e-03 17.768 < 2e-16 ***

27.13 I(trend^2) 2.078e-04 2.109e-05 9.853 < 2e-16 ***

27.14 season2 1.139e+00 7.956e-01 1.432 0.153

27.15 season3 3.248e+00 7.956e-01 4.083 5.67e-05 ***

27.16 season4 4.450e+00 7.956e-01 5.593 4.93e-08 ***

27.17 season5 5.321e+00 7.956e-01 6.688 1.06e-10 ***

27.18 season6 5.866e+00 7.956e-01 7.373 1.55e-12 ***

27.19 season7 6.718e+00 7.956e-01 8.444 1.23e-15 ***

27.20 season8 4.521e+00 7.957e-01 5.682 3.08e-08 ***

27.21 season9 4.476e+00 7.957e-01 5.625 4.16e-08 ***

27.22 season10 7.593e+00 7.957e-01 9.543 < 2e-16 ***

27.23 season11 1.431e+01 8.033e-01 17.808 < 2e-16 ***

27.24 season12 2.508e+01 8.033e-01 31.223 < 2e-16 ***

27.25

27.26 Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

27.27

27.28 Residual standard error: 2.923 on 308 degrees of freedom

27.29 Multiple R-squared: 0.978, Adjusted R-squared: 0.9771

27.30 F-statistic: 1054 on 13 and 308 DF, p-value: < 2.2e-16

28 Forecast future values using the fitted model

train.lm.t.s.pred <- forecast(train.lm.trend.season, h = stepsAhead)

29 Plot the forecast and compare with actual values

plot(train.lm.t.s.pred, ylim = c(40, 130), ylab = “Retail Sales Index - J5AH”, xlab = “Year”, bty = “l”, xaxt = “n”, xlim = c(1996,2024), main =“Forecast vs Actual: Retail Sales Index”, lty = 2)

30 Customizing the x-axis with labels at two-year intervals

axis(1, at = seq(1996, 2024, 2), labels = format(seq(1996, 2024, 2)))

31 Adding the fitted values from the model to the plot

lines(train.lm.t.s.pred$fitted, lwd = 1, col = “blue”)

32 Adding the validation set to the plot for comparison

lines(valid.ts)

33 Evaluate the forecast’s accuracy against the validation set

accuracy(train.lm.t.s.pred$mean, as.numeric(valid.ts))

33.1 ME RMSE MAE MPE MAPE

33.2 Test set 5.237359 5.945129 5.237359

Video 1: Lab 14 – Inflation and Models with Trend and Seasonality

Cam Calderon

2023-12-11

Activity 1

Why Printing Trillions of Dollars May Not Cause Inflation? Task 1. Review the following video and answer the Question:

Activity 2

Forecast inflation. Would inflation increase beyond the target? ; if so, for how long time?

Task 1. Download a dataset for inflation and run the code line by line, you can always ask your tutor if you do not understand something.

#install.packages(“devtools”) library(devtools)

33.3 Loading required package: usethis

#install_github(“ahmedmohamedali/eikonapir”, force=TRUE) library(eikonapir) #opens the package to connect to Refinitiv eikonapir::set_proxy_port(9000L) #opens a port to Refinitiv eikonapir::set_app_id(‘use APPKEY in Refinitiv and paste the API key here’) # Remember to keep your Refinitiv APP running before using your API key.

34 GBHICM=ECI is the name of the inflation time series.

inflation <- get_timeseries(“GBHICM=ECI”, start_date = “2009-01-01T01:00:00”, end_date= paste0(Sys.Date(), “T01:00:00”), raw_output = FALSE, interval = “monthly”)

35 str function shows the characteristics and format of ‘inflation’

str(inflation)

35.1 ‘data.frame’: 178 obs. of 3 variables:

35.2 $ TIMESTAMP: chr “2009-01-31T00:00:00Z” “2009-02-28T00:00:00Z” “2009-03-31T00:00:00Z” “2009-04-30T00:00:00Z” …

35.3 $ VALUE : chr “-0.7” “0.9” “0.2” “0.2” …

35.4 $ NA : chr “GBHICM=ECI” “GBHICM=ECI” “GBHICM=ECI” “GBHICM=ECI” …

inflation.n <- as.numeric(inflation$‘VALUE’) # convert the data to numeric inflation.df <-as.data.frame(inflation.n) # makes a data frame View(inflation)

Task 2. Open the appropriate libraries to forecast and declare the data a time series.

#install.packages(“forecast”) library(forecast)

35.5 Registered S3 method overwritten by ‘quantmod’:

35.6 method from

35.7 as.zoo.data.frame zoo

#install.packages(“zoo”) library(zoo)

35.8

35.9 Attaching package: ‘zoo’

35.10 The following objects are masked from ‘package:base’:

35.11

35.12 as.Date, as.Date.numeric

36 set as a time series and plot the data

inflation.ts <- ts(inflation.df, start = c(2009,1), end = c(2023, 10), freq = 12) plot(inflation.ts)

Task 3. Create the appropriate training and validation partitions to forecast out-of-sample (OOS)

37 This section contains the partition of the dataset into training and validation period.

stepsAhead <- 12 nTrain <- length(inflation.ts) - stepsAhead train.ts <- window(inflation.ts, start = c(2009, 1), end = c(2009,nTrain)) valid.ts <- window(inflation.ts, start = c(2009, nTrain + 1), end = c(2009, nTrain + stepsAhead))

Task 4. Fit a model with trend quadratic trend and seasonality; afterwards, predict 12 months using the fitted model, and plot your results

train.lm.trend.season <- tslm(train.ts ~ trend + I(trend^2) + season) summary (train.lm.trend.season)

37.1

37.2 Call:

37.3 tslm(formula = train.ts ~ trend + I(trend^2) + season)

37.4

37.5 Residuals:

37.6 Min 1Q Median 3Q Max

37.7 -0.9395 -0.1636 0.0185 0.1481 1.6026

37.8

37.9 Coefficients:

37.10 Estimate Std. Error t value Pr(>|t|)

37.11 (Intercept) -2.215e-01 1.032e-01 -2.147 0.0334 *

37.12 trend -1.098e-02 1.958e-03 -5.608 9.41e-08 ***

37.13 I(trend^2) 7.281e-05 1.136e-05 6.411 1.73e-09 ***

37.14 season2 9.780e-01 1.134e-01 8.625 7.90e-15 ***

37.15 season3 7.915e-01 1.134e-01 6.980 8.55e-11 ***

37.16 season4 1.012e+00 1.134e-01 8.925 1.35e-15 ***

37.17 season5 7.253e-01 1.134e-01 6.396 1.87e-09 ***

37.18 season6 5.670e-01 1.134e-01 4.999 1.57e-06 ***

37.19 season7 4.728e-01 1.134e-01 4.168 5.13e-05 ***

37.20 season8 8.713e-01 1.134e-01 7.681 1.80e-12 ***

37.21 season9 6.840e-01 1.135e-01 6.029 1.20e-08 ***

37.22 season10 7.823e-01 1.135e-01 6.894 1.36e-10 ***

37.23 season11 6.692e-01 1.156e-01 5.788 3.95e-08 ***

37.24 season12 8.604e-01 1.156e-01 7.440 6.91e-12 ***

37.25

37.26 Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

37.27

37.28 Residual standard error: 0.3 on 152 degrees of freedom

37.29 Multiple R-squared: 0.5306, Adjusted R-squared: 0.4905

37.30 F-statistic: 13.22 on 13 and 152 DF, p-value: < 2.2e-16

train.lm.t.s.pred <- forecast(train.lm.trend.season, h = stepsAhead, level = 0)

plot(train.lm.t.s.pred , ylab = “inflation Sales”, xlab = “Year”, bty = “l”, ylim = c(-0.9,3), xaxt = “n”, xlim = c(2009,2024), main =“Monthly growth for the quantity bought for all inflationing sales”, flty = 2)

axis(1, at = seq(2009, 2024, 1), labels = format(seq(2009, 2024, 1))) #lines(train.lm.t.s.pred$fitted, lwd = 2, col = “blue”) lines(valid.ts)

#lines(train.ts)

Task 5. Compute a measure of predictive accuracy.

accuracy(train.lm.t.s.pred$mean, as.numeric(valid.ts))

37.31 ME RMSE MAE MPE MAPE

37.32 Test set -0.377735 0.5203881 0.4194823 -Inf Inf

Task 6. Calculate the inflation for 2 year ahead after the end of the time series, and plot the results.

inflation.lm.trend.season <- tslm(inflation.ts ~ trend + I(trend^2) + season) summary (inflation.lm.trend.season)

37.33

37.34 Call:

37.35 tslm(formula = inflation.ts ~ trend + I(trend^2) + season)

37.36

37.37 Residuals:

37.38 Min 1Q Median 3Q Max

37.39 -0.8591 -0.1448 0.0089 0.1545 1.6979

37.40

37.41 Coefficients:

37.42 Estimate Std. Error t value Pr(>|t|)

37.43 (Intercept) -3.029e-01 1.038e-01 -2.919 0.004004 **

37.44 trend -8.086e-03 1.834e-03 -4.409 1.88e-05 ***

37.45 I(trend^2) 5.215e-05 9.925e-06 5.254 4.57e-07 ***

37.46 season2 1.026e+00 1.140e-01 8.998 5.49e-16 ***

37.47 season3 8.316e-01 1.140e-01 7.294 1.22e-11 ***

37.48 season4 1.064e+00 1.140e-01 9.331 < 2e-16 ***

37.49 season5 7.627e-01 1.140e-01 6.689 3.37e-10 ***

37.50 season6 5.748e-01 1.140e-01 5.041 1.22e-06 ***

37.51 season7 4.535e-01 1.140e-01 3.976 0.000105 ***

37.52 season8 8.720e-01 1.141e-01 7.646 1.66e-12 ***

37.53 season9 7.104e-01 1.141e-01 6.228 3.81e-09 ***

37.54 season10 7.688e-01 1.141e-01 6.739 2.58e-10 ***

37.55 season11 6.875e-01 1.161e-01 5.922 1.81e-08 ***

37.56 season12 8.649e-01 1.161e-01 7.449 5.09e-12 ***

37.57

37.58 Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

37.59

37.60 Residual standard error: 0.3122 on 164 degrees of freedom

37.61 Multiple R-squared: 0.5102, Adjusted R-squared: 0.4714

37.62 F-statistic: 13.14 on 13 and 164 DF, p-value: < 2.2e-16

inflation.lm.t.s.pred <- forecast(inflation.lm.trend.season, h = stepsAhead*2, level = 0)

plot(inflation.lm.t.s.pred , ylab = “inflation Sales”, xlab = “Year”, bty = “l”, ylim = c(-0.9,3), xaxt = “n”, xlim = c(2009,2024), main =“Monthly growth for the quantity bought for all inflationing sales”, flty = 2) axis(1, at = seq(2009, 2024, 1), labels = format(seq(2009, 2024, 1)))

Task 7. Calculate an additional 10 years of inflation and forecast the year when inflation will be above 2%. Plot your results.

inflation.lm.t.s.pred$mean

37.63 Jan Feb Mar Apr May Jun

37.64 2023

37.65 2024 -0.05816355 0.97851512 0.79519379 1.03853913 0.74855113 0.57189647

37.66 2025 0.07883470 1.11676487 0.93469504 1.17929188 0.89055539 0.71515222

37.67 Jul Aug Sep Oct Nov Dec

37.68 2023 0.60799418 0.79595186

37.69 2024 0.46190847 0.89192047 0.74193248 0.81194448 0.74248943 0.93169861

37.70 2025 0.60641573 1.03767923 0.88894274 0.96020624

ma.trailing <- rollsum(inflation.lm.t.s.pred$mean, k = 12, align = “right”) plot(ma.trailing)

ma.trailing

37.71 Jan Feb Mar Apr May Jun Jul

37.72 2024

37.73 2025 8.793424 8.931674 9.071175 9.211928 9.353932 9.497188 9.641695

37.74 Aug Sep Oct Nov Dec

37.75 2024 8.386184 8.520679 8.656426

37.76 2025 9.787454 9.934464 10.082726

38 10 year prediction

ma.trailing.10year <- tslm(ma.trailing ~ trend + I(trend^2) + season) summary (ma.trailing.10year)

38.1

38.2 Call:

38.3 tslm(formula = ma.trailing ~ trend + I(trend^2) + season)

38.4

38.5 Residuals:

38.6 ALL 13 residuals are 0: no residual degrees of freedom!

38.7

38.8 Coefficients: (1 not defined because of singularities)

38.9 Estimate Std. Error t value Pr(>|t|)

38.10 (Intercept) 8.253e+00 NaN NaN NaN

38.11 trend 1.326e-01 NaN NaN NaN

38.12 I(trend^2) 6.258e-04 NaN NaN NaN

38.13 season2 -1.595e-15 NaN NaN NaN

38.14 season3 -1.140e-15 NaN NaN NaN

38.15 season4 -2.230e-15 NaN NaN NaN

38.16 season5 -3.046e-15 NaN NaN NaN

38.17 season6 -3.667e-15 NaN NaN NaN

38.18 season7 -3.988e-15 NaN NaN NaN

38.19 season8 -4.062e-15 NaN NaN NaN

38.20 season9 -2.229e-15 NaN NaN NaN

38.21 season10 -1.815e-15 NaN NaN NaN

38.22 season11 2.475e-15 NaN NaN NaN

38.23 season12 NA NA NA NA

38.24

38.25 Residual standard error: NaN on 0 degrees of freedom

38.26 Multiple R-squared: 1, Adjusted R-squared: NaN

38.27 F-statistic: NaN on 12 and 0 DF, p-value: NA

ma.trailing.10year.pred <- forecast(ma.trailing.10year, h = 120, level = 0)

38.28 Warning in predict.lm(predict_object, newdata = newdata, se.fit = TRUE, :

38.29 prediction from rank-deficient fit; attr(*, “non-estim”) has doubtful cases

38.30 Warning in qt((1 - level)/2, df): NaNs produced

plot(ma.trailing.10year.pred) abline(h =2, col=“red”, )

##ACTIVITY 3. You have learnt how to forecast the future with Linear regression; now use the dataset for your assessment (presentation).

Video 2: Lab 17 Confusion Matrix, Melbourne rainfall forecast, what are odds of an event happening?

This week we covered during the lecture binary forecasts using logistic regression. This technique will help you to forecast binary events, e.g. is your team going to win or not, is there going to be a recession again this year or not, is certain share going to increase greatly its price or not, etc. The basic idea is to forecast the probability of certain event happening. We will use 0 when the event occurs and 1 when it does not occur.

Activity 1. Confusion Matrix

Task1. The following table shows the typical outcome of a confusion matrix using R. Suppose a 2x2 table with the notation ‘Reference’ for your binary dependent variable (e.g. recession or not recession), and the notation ‘Predicted’ for your forecast. The confusion matrix shows how many times the estimated model perform well in the green sections of the table; contrary, the red sections are the number of mistakes of your prediction model/forecasts.

Read the documentation about confusionatrix for R. Find and read the section ‘Details’ in the following link: . The link will explain the outcome of the confusion matrix in R.

Task 2. Download in your appropriate datasets folder the file ‘MelbourneRainfall.xls’ from the folder “Data Sets” in bb. Afterwards, run the following code:

39 Loading the ‘caret’ package for modeling and evaluation

library(caret)

40 Reading the dataset and formatting the date

rain.df <- read.csv(“MelbourneRainfall.csv”)

rain.df\(Date <- as.Date(rain.df\)Date, format=“%d/%m/%Y”)

41 Creating a binary variable ‘Rainy’ based on the rainfall amount

rain.df\(Rainy <- ifelse(rain.df\)Rainfall > 0, 1, 0)

42 Preparing variables for logistic regression

nPeriods <- length(rain.df$Rainy)

rain.df\(Lag1 <- c(NA, rain.df\)Rainfall[1:(nPeriods-1)]) # Lagged variable

rain.df$t <- seq(1, nPeriods, 1)

rain.df\(Seasonal_sine = sin(2 * pi * rain.df\)t/ 365.25) # Seasonal sine component

rain.df\(Seasonal_cosine = cos(2 * pi * rain.df\)t/ 365.25) # Seasonal cosine component

43 Splitting the dataset into training and validation sets

train.df <- rain.df[rain.df$Date <= as.Date(“31/12/2009”, format=“%d/%m/%Y”),]

train.df <- train.df[-1,] # Removing the first row due to NA in Lag1

valid.df <- rain.df[rain.df$Date > as.Date(“31/12/2009”, format=“%d/%m/%Y”),]

xvalid <- valid.df[, c(4,6,7)] # Validation predictors

44 Building the logistic regression model

rainy.lr <- glm(Rainy ~ Lag1 + Seasonal_sine + Seasonal_cosine, data = train.df, family = “binomial”)

summary(rainy.lr) # Model summary

45 Prediction and evaluation using the confusion matrix

rainy.lr.pred <- predict(rainy.lr, xvalid, type = “response”)

confusionMatrix(table(ifelse(rainy.lr\(fitted > 0.5, 1, 0), train.df\)Rainy)) # Training set

confusionMatrix(table(ifelse(rainy.lr.pred > 0.5, 1, 0), valid.df$Rainy)) # Validation set

Task 3. Review the confusion matrix and explain what is accuracy, sensitivity, and prevalence in the R outcome.

Call:

glm(formula = Rainy ~ Lag1 + Seasonal_sine + Seasonal_cosine,

family = “binomial”, data = train.df)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.76888 0.03858 -19.927 < 2e-16 ***

Lag1 0.11187 0.01137 9.843 < 2e-16 ***

Seasonal_sine -0.26885 0.05049 -5.324 1.01e-07 ***

Seasonal_cosine -0.37134 0.05067 -7.328 2.33e-13 ***


Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 4751.8 on 3651 degrees of freedom

Residual deviance: 4533.7 on 3648 degrees of freedom

AIC: 4541.7

Number of Fisher Scoring iterations: 4

rainy.lr.pred <- predict(rainy.lr, xvalid, type = “response”)

confusionMatrix(table(ifelse(rainy.lr\(fitted > 0.5, 1, 0), train.df\)Rainy))

Confusion Matrix and Statistics

0 1

0 2251 1115

1 104 182

Accuracy : 0.6662

95% CI : (0.6507, 0.6815)

No Information Rate : 0.6449

P-Value [Acc > NIR] : 0.003566

Kappa : 0.1166

Mcnemar’s Test P-Value : < 2.2e-16

Sensitivity : 0.9558

Specificity : 0.1403

Pos Pred Value : 0.6687

Neg Pred Value : 0.6364

Prevalence : 0.6449

Detection Rate : 0.6164

Detection Prevalence : 0.9217

Balanced Accuracy : 0.5481

‘Positive’ Class : 0

confusionMatrix(table(ifelse(rainy.lr.pred > 0.5, 1, 0), valid.df$Rainy))

Confusion Matrix and Statistics

0 1

0 373 220

1 21 55

Accuracy : 0.6398

95% CI : (0.6021, 0.6762)

No Information Rate : 0.5889

P-Value [Acc > NIR] : 0.004043

Kappa : 0.1647

Mcnemar’s Test P-Value : < 2.2e-16

Sensitivity : 0.9467

Specificity : 0.2000

Pos Pred Value : 0.6290

Neg Pred Value : 0.7237

Prevalence : 0.5889

Detection Rate : 0.5575

Detection Prevalence : 0.8864

Balanced Accuracy : 0.5734

‘Positive’ Class : 0

Confusion Matrix and Statistics Explanation

Confusion Matrix Structure

A (True Negative): Predicted no event, and no event occurred.

B (False Positive): Predicted an event, but no event occurred.

C (False Negative): Predicted no event, but the event occurred.

D (True Positive): Predicted an event, and the event occurred.

Key Statistics

Accuracy: Proportion of true results (both true positives and true negatives) among the total number of cases examined.

Formula: (A+D)/(A+B+C+D)

Indicates overall correctness of the model.

Sensitivity (Recall or True Positive Rate): Proportion of actual positives that are correctly identified.

Formula: D/(C+D)

Indicates the model’s ability to correctly predict positive events.

Specificity (True Negative Rate): Proportion of actual negatives that are correctly identified.

Formula: A/(A+B)

Indicates the model’s ability to correctly predict negative events.

Positive Predictive Value (Precision): Proportion of positive identifications that were actually correct.

Formula: D/(B+D)

Negative Predictive Value: Proportion of negative identifications that were actually correct.

Formula: A/(A+C)

Prevalence: Proportion of actual positives in the dataset.

Formula: (C+D)/(A+B+C+D)

Detection Rate: Proportion of the total sample that are true positives.

Formula: D/(A+B+C+D)

Detection Prevalence: Proportion of the total sample that are predicted positives.

Formula: (B+D)/(A+B+C+D)

Balanced Accuracy: Average of sensitivity and specificity.

Formula: (Sensitivity + Specificity) / 2

Useful in imbalanced datasets.

Kappa: Agreement between observed accuracy and expected accuracy (chance agreement).

Interpretation of the Output

Training Set Confusion Matrix:

Accuracy: 66.62%

Sensitivity: 95.58%

Specificity: 14.03%

The model is better at predicting non-rainy days than rainy days.

Validation Set Confusion Matrix:

Accuracy: 63.98%

Sensitivity: 94.67%

Specificity: 20.00%

Similar performance as the training set, indicating consistency.

Conclusion

The model has high sensitivity but low specificity, indicating it’s more effective at predicting non-events (no rain) than events (rain). The accuracy is moderate, suggesting room for improvement, possibly by including more predictors or using different modeling techniques.

Lab 12: Linear Regression and autocorrelation

Example used in this lab

Cam Calderon

2023-12-09

R Markdown

46 Regression Models - Trend and Seasonality Analysis - 09-12-2023

47 Load necessary libraries

library(readxl) # Uncomment the next line to install the ‘forecast’ package if not already installed # install.packages(“forecast”) library(forecast)

47.1 Registered S3 method overwritten by ‘quantmod’:

47.2 method from

47.3 as.zoo.data.frame zoo

48 Load the retail sales index data from the Office for National Statistics (ONS)

ONS_retail <- read_excel(“C:/Users/Cam/OneDrive - De Montfort University/a- BECS2002 2023-24 - PUBLIC/002 Datasets/rsi 2023.xlsx”, skip = 10)

49 Uncomment the next line to view the loaded data

#View(ONS_retail)

50 Convert the retail sales data into a time series object

retail.ts <- ts(ONS_retail$rsi, start = c(1996,1), end = c(2023, 10), freq = 12)

51 Plot the initial time series for the retail sales index

plot(retail.ts, xlab = “Year”, ylab = “Retail Sales Index - J5AH”, main = “Value of Retail Sales at Current Prices Non-SA”)

52 Split the data into training and validation sets

stepsAhead <- 12 nTrain <- length(retail.ts) - stepsAhead train.ts <- window(retail.ts, start = c(1996, 1), end = c(1996,nTrain)) valid.ts <- window(retail.ts, start = c(1996, nTrain + 1), end = c(1996, nTrain + stepsAhead))

53 Fit a linear model to the training set with trend and seasonal components

train.lm.trend.season <- tslm(train.ts ~ trend + I(trend^2) + season) summary(train.lm.trend.season)

53.1

53.2 Call:

53.3 tslm(formula = train.ts ~ trend + I(trend^2) + season)

53.4

53.5 Residuals:

53.6 Min 1Q Median 3Q Max

53.7 -23.9822 -1.0645 0.0657 1.3221 10.6067

53.8

53.9 Coefficients:

53.10 Estimate Std. Error t value Pr(>|t|)

53.11 (Intercept) 3.962e+01 7.251e-01 54.649 < 2e-16 ***

53.12 trend 1.250e-01 7.034e-03 17.768 < 2e-16 ***

53.13 I(trend^2) 2.078e-04 2.109e-05 9.853 < 2e-16 ***

53.14 season2 1.139e+00 7.956e-01 1.432 0.153

53.15 season3 3.248e+00 7.956e-01 4.083 5.67e-05 ***

53.16 season4 4.450e+00 7.956e-01 5.593 4.93e-08 ***

53.17 season5 5.321e+00 7.956e-01 6.688 1.06e-10 ***

53.18 season6 5.866e+00 7.956e-01 7.373 1.55e-12 ***

53.19 season7 6.718e+00 7.956e-01 8.444 1.23e-15 ***

53.20 season8 4.521e+00 7.957e-01 5.682 3.08e-08 ***

53.21 season9 4.476e+00 7.957e-01 5.625 4.16e-08 ***

53.22 season10 7.593e+00 7.957e-01 9.543 < 2e-16 ***

53.23 season11 1.431e+01 8.033e-01 17.808 < 2e-16 ***

53.24 season12 2.508e+01 8.033e-01 31.223 < 2e-16 ***

53.25

53.26 Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

53.27

53.28 Residual standard error: 2.923 on 308 degrees of freedom

53.29 Multiple R-squared: 0.978, Adjusted R-squared: 0.9771

53.30 F-statistic: 1054 on 13 and 308 DF, p-value: < 2.2e-16

54 Forecast future values using the fitted linear model

train.lm.t.s.pred <- forecast(train.lm.trend.season, h = stepsAhead)

55 Plot the forecast and compare with actual values

plot(train.lm.t.s.pred, ylim = c(40, 130), ylab = “Retail Sales Index - J5AH”, xlab = “Year”, bty = “l”, xaxt = “n”, xlim = c(1996,2024), main =“Forecast vs Actual: Retail Sales Index”, lty = 2)

56 Customize the x-axis with labels at two-year intervals

axis(1, at = seq(1996, 2024, 2), labels = format(seq(1996, 2024, 2)))

57 Add the fitted values from the model to the plot

lines(train.lm.t.s.pred$fitted, lwd = 1, col = “blue”)

58 Add the validation set to the plot for comparison

lines(valid.ts)

59 Evaluate the accuracy of the forecast against the validation set

accuracy(train.lm.t.s.pred$mean, as.numeric(valid.ts))

59.1 ME RMSE MAE MPE MAPE

59.2 Test set 5.237359 5.945129 5.237359 4.419619 4.419619

60 Analyze autocorrelation in the series

Acf(retail.ts, lag.max=12, main =“Retail Sales - Autocorrelations at Lags 1-12”)

61 Analyze autocorrelation in the residuals of the linear model

Acf(train.lm.trend.season$residuals, lag.max=12, main =“Retail Sales - Errors & Autocorrelation”)

62 Fit an ARIMA model to the residuals of the linear model as a second layer

train.res.arima <- Arima(train.lm.trend.season$residuals, order = c(1,0,0)) train.res.arima.pred <- forecast(train.res.arima, h=stepsAhead)

63 Plot the new residuals

plot(train.lm.trend.season\(residuals, ylim = c(-7, 7), ylab = "Residuals", xlab = "Year", bty = "l", xaxt = "n", xlim = c(1996,2024), main ="Residuals after ARIMA Model") axis(1, at = seq(1996, 2024, 2), labels = format(seq(1996, 2024, 2))) lines(train.res.arima.pred\)fitted, lwd = 2, col = “blue”)

64 Analyze the autocorrelation in the residuals of the ARIMA model

Acf(train.res.arima$residuals, lag.max=12, main =“Autocorrelation in Residuals of the ARIMA Model”)

65 Final plot showing the residuals of the ARIMA model

plot(train.res.arima$residuals, ylim = c(-7, 7), ylab = “Residuals”, xlab = “Year”, bty = “l”, xaxt = “n”, xlim = c(1996,2024), main =““) axis(1, at = seq(1996, 2024, 2), labels = format(seq(1996, 2024, 2)))

#lines(train.res.arima.pred$fitted, lwd = 2, col = “blue”)

Lab 13: Linear regression and evaluating predictability

Example used in this lab

Cam Calderon

2023-12-10

R

For a detailed description of ARIMA models see classic time series textbooks such as Chapter 4 in C. Chatfield. The Analysis of Time Series: An Introduction. Chapman & Hall/CRC, 6th edition, 2003

An Autoregressive Integrated Moving Average Model (ARIMA) is one that directly models the autocorrelation of the series values as well as the autocorrelations of the forecast errors.

To see the three components of an ARIMA model (AR, I, and MA), let’s start with the equations on the slides.

An AR(p) model captures the autocorrelations of the series values at lags 1, 2,…, p.

Next, we add autocorrelation terms for the forecast errors (called “Moving Average”) up to lag q to get an ARMA(p, q) model

AR and ARMA models can only be fitted to data without trend or seasonality. Therefore, an ARIMA model incorporates a preliminary step of differencing, which removes trend. This differencing operation is the “I” (Integrated) in ARIMA. The order of differencing, denoted by parameter d, indicates how many rounds of lag-1 differencing are performed: d = 0 means no differencing, which is suitable if the series lacks a trend; d = 1 means differencing the series once (at lag-1), which can remove a linear trend; d = 2 means differencing the series twice, which can remove a quadratic trend.

Similarly, a seasonal-ARIMA model incorporates a step of differencing to remove seasonality and/or autocorrelation terms for capturing remaining seasonal effects. 6 6 See e.g., people.duke.edu/ ~rnau/seasarim.htm ARIMA models require the user to set the values of p, d, q and then the software estimates the β and θ parameters. Shmueli, Galit; Lichtendahl Jr, Kenneth C.. Practical Time Series Forecasting with R: A Hands-On Guide [2nd Edition] (Practical Analytics) (p. 152). Axelrod Schnall Publishers. Kindle Edition.

66 Regression Models - Trend and Seasonality Analysis - 09-12-2023

67 Loading necessary libraries

library(readxl) # For reading Excel files # Uncomment below to install ‘forecast’ package if it’s not already installed # install.packages(“forecast”) library(forecast) # For time series forecasting

67.1 Registered S3 method overwritten by ‘quantmod’:

67.2 method from

67.3 as.zoo.data.frame zoo

68 Loading the retail sales index data from the Office for National Statistics (ONS)

69 Make sure the file path is correct and accessible

ONS_retail <- read_excel(“C:/Users/Cam/OneDrive - De Montfort University/a- BECS2002 2023-24 - PUBLIC/002 Datasets/rsi 2023.xlsx”, skip = 10)

70 Uncomment below to view the loaded data

71 View(ONS_retail)

72 Converting the retail sales data into a time series object

73 Ensure that the start and end dates are correctly set

retail.ts <- ts(ONS_retail$rsi, start = c(1996,1), end = c(2023, 10), freq = 12)

74 Plotting the initial time series for the retail sales index

76 Splitting the data into training and validation sets

77 This step is crucial for model validation and avoiding overfitting

stepsAhead <- 12 # Number of steps to forecast nTrain <- length(retail.ts) - stepsAhead # Training set length train.ts <- window(retail.ts, start = c(1996, 1), end = c(1996, nTrain)) # Training set valid.ts <- window(retail.ts, start = c(1996, nTrain + 1), end = c(1996, nTrain + stepsAhead)) # Validation set

78 Fitting a linear model to the training set with trend and seasonal components

79 A linear model is a good starting point for time series with trend and seasonality

train.lm.trend.season <- tslm(train.ts ~ trend + I(trend^2) + season) summary(train.lm.trend.season)

79.1

79.2 Call:

79.3 tslm(formula = train.ts ~ trend + I(trend^2) + season)

79.4

79.5 Residuals:

79.6 Min 1Q Median 3Q Max

79.7 -23.9822 -1.0645 0.0657 1.3221 10.6067

79.8

79.9 Coefficients:

79.10 Estimate Std. Error t value Pr(>|t|)

79.11 (Intercept) 3.962e+01 7.251e-01 54.649 < 2e-16 ***

79.12 trend 1.250e-01 7.034e-03 17.768 < 2e-16 ***

79.13 I(trend^2) 2.078e-04 2.109e-05 9.853 < 2e-16 ***

79.14 season2 1.139e+00 7.956e-01 1.432 0.153

79.15 season3 3.248e+00 7.956e-01 4.083 5.67e-05 ***

79.16 season4 4.450e+00 7.956e-01 5.593 4.93e-08 ***

79.17 season5 5.321e+00 7.956e-01 6.688 1.06e-10 ***

79.18 season6 5.866e+00 7.956e-01 7.373 1.55e-12 ***

79.19 season7 6.718e+00 7.956e-01 8.444 1.23e-15 ***

79.20 season8 4.521e+00 7.957e-01 5.682 3.08e-08 ***

79.21 season9 4.476e+00 7.957e-01 5.625 4.16e-08 ***

79.22 season10 7.593e+00 7.957e-01 9.543 < 2e-16 ***

79.23 season11 1.431e+01 8.033e-01 17.808 < 2e-16 ***

79.24 season12 2.508e+01 8.033e-01 31.223 < 2e-16 ***

79.25

79.26 Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

79.27

79.28 Residual standard error: 2.923 on 308 degrees of freedom

79.29 Multiple R-squared: 0.978, Adjusted R-squared: 0.9771

79.30 F-statistic: 1054 on 13 and 308 DF, p-value: < 2.2e-16

80 Forecasting future values using the fitted linear model

81 This step helps in predicting future values based on historical data

train.lm.t.s.pred <- forecast(train.lm.trend.season, h = stepsAhead)

82 Plotting the forecast and comparing it with actual values

83 Visual comparison helps in assessing the model’s performance

plot(train.lm.t.s.pred, ylim = c(40, 130), ylab = “Retail Sales Index - J5AH”, xlab = “Year”, main =“Forecast vs Actual: Retail Sales Index”, lty = 2) axis(1, at = seq(1996, 2024, 2), labels = format(seq(1996, 2024, 2))) lines(train.lm.t.s.pred$fitted, lwd = 1, col = “blue”) # Adding fitted values to the plot lines(valid.ts) # Adding validation set for comparison

84 Evaluating the accuracy of the forecast against the validation set

85 Accuracy metrics provide quantitative measures of the forecast’s performance

accuracy(train.lm.t.s.pred$mean, as.numeric(valid.ts))

85.1 ME RMSE MAE MPE MAPE

85.2 Test set 5.237359 5.945129 5.237359 4.419619 4.419619

86 Analyzing autocorrelation in the series and in the residuals of the linear model

87 Autocorrelation checks help in identifying any lingering patterns in the data

Acf(retail.ts, lag.max=12, main =“Retail Sales - Autocorrelations at Lags 1-12”)

Acf(train.lm.trend.season$residuals, lag.max=12, main =“Retail Sales - Errors & Autocorrelation”)

88 Fitting an AR model to the residuals as a second layer of analysis

89 This step is to model any autocorrelation not captured by the linear model

train.res.arima <- Arima(train.lm.trend.season$residuals, order = c(1,0,0)) train.res.arima.pred <- forecast(train.res.arima, h=stepsAhead)

90 Plotting and analyzing the residuals and autocorrelation of the ARIMA model

91 Helps in understanding the model’s effectiveness in handling autocorrelation

plot(train.lm.trend.season\(residuals, ylim = c(-7, 7), main ="Residuals after ARIMA Model", xlab = "Year", ylab = "Residuals") axis(1, at = seq(1996, 2024, 2), labels = format(seq(1996, 2024, 2))) lines(train.res.arima.pred\)fitted, lwd = 2, col = “blue”)

Acf(train.res.arima$residuals, lag.max=12, main =“Autocorrelation in Residuals of the ARIMA Model”)

92 Using auto.arima to automatically find the best ARIMA model specification

93 This function helps in simplifying the model selection process

train.arima <- auto.arima(train.ts) summary(train.arima)

93.1 Series: train.ts

93.2 ARIMA(2,1,2)(0,1,2)[12]

93.3

93.4 Coefficients:

93.5 ar1 ar2 ma1 ma2 sma1 sma2

93.6 -0.1844 0.4917 -0.0388 -0.8752 -0.8100 0.1608

93.7 s.e. 0.0919 0.0915 0.0657 0.0702 0.0793 0.0696

93.8

93.9 sigma^2 = 3.865: log likelihood = -650.06

93.10 AIC=1314.12 AICc=1314.5 BIC=1340.26

93.11

93.12 Training set error measures:

93.13 ME RMSE MAE MPE MAPE MASE

93.14 Training set 0.1157932 1.907105 1.117698 0.03713381 1.414958 0.3776832

93.15 ACF1

93.16 Training set 0.02814524

94 Forecasting and plotting using the auto.arima model

95 Provides a comparative view of the forecast vs actual values

train.arima.pred <- forecast(train.arima, h=stepsAhead) plot(train.arima.pred, ylim = c(40, 130), main =“Forecast vs Actual: Retail Sales Index”, xlab = “Year”, ylab = “Retail Sales Index - J5AH”) axis(1, at = seq(1996, 2024, 2), labels = format(seq(1996, 2024, 2))) lines(train.arima.pred$fitted, lwd = 1, col = “blue”) lines(valid.ts)

96 Evaluating accuracy and analyzing autocorrelation in the residuals of the auto.arima model

97 Final step to assess the overall performance of the chosen ARIMA model

accuracy(train.arima.pred$mean, as.numeric(valid.ts))

97.1 ME RMSE MAE MPE MAPE

97.2 Test set 2.704173 3.481628 2.704173 2.37663 2.37663

Acf(train.arima$residuals, lag.max=12, main =“Retail Sales - Errors & Autocorrelation”)

Lab 14: Regression and external information

Example used in this lab

Cam Calderon

2023-12-10

External Information - GDP~Investment+Consumption

  1. Package Installation and Library Loading

98 Install ‘forecast’ package for time series analysis and forecasting techniques

#install.packages(“forecast”) # Load the ‘forecast’ library to access its functions library(forecast)

98.1 Registered S3 method overwritten by ‘quantmod’:

98.2 method from

98.3 as.zoo.data.frame zoo

99 Install ‘lubridate’ package for easier date-time manipulation

#install.packages(“lubridate”) # Load the ‘lubridate’ library for working with dates library(lubridate)

99.1

99.2 Attaching package: ‘lubridate’

99.3 The following objects are masked from ‘package:base’:

99.4

99.5 date, intersect, setdiff, union

100 Load the ‘readr’ library for data import and processing

library(readr)

  1. Reading and Preparing the Data

101 Read the GDP data from a CSV file

gdp.df <- read_csv(“C:/Users/Cam/OneDrive - De Montfort University/a- BECS2002 2023-24 - PUBLIC/002 Datasets/gdp external info 3.csv”)

101.1 Rows: 87 Columns: 4

101.2 ── Column specification ────────────────────────────────────────────────────────

101.3 Delimiter: “,”

101.4 chr (1): date

101.5 dbl (3): gdp, invest, consu

101.6

101.7 ℹ Use spec() to retrieve the full column specification for this data.

101.8 ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

102 Convert the ‘date’ column to Date format for time series analysis

gdp.df\(Date <- as.Date(gdp.df\)date, format = “%d/%m/%Y”)

103 Combine ‘invest’ and ‘consu’ columns as a data frame for modeling

x <- as.data.frame(cbind(gdp.df\(invest, gdp.df\)consu))

104 Assign GDP values to ‘y’

y <- gdp.df$gdp

  1. Setting Up Training and Validation Sets

105 Determine the number of total and validation observations

nTotal <- length(y) nValid <- 20 nTrain <- nTotal - nValid

106 Split the data into training and validation sets

xTrain <- x[1:nTrain,] yTrain <- y[1:nTrain] xValid <- x[(nTrain + 1):nTotal,] yValid <- y[(nTrain + 1):nTotal]

  1. Time Series Model and Forecasting

107 Convert the training set ‘yTrain’ into a time series object

yTrain.ts <- ts(yTrain)

109 Fit a time series linear model (tslm) to the training data

gdp.tslm <- tslm(formula, data = xTrain, lambda = 1)

110 Forecast GDP using the fitted model and validation predictors

gdp.tslm.pred <- forecast(gdp.tslm, newdata = xValid)

111 Plot the forecast with actual data for comparison

plot(gdp.tslm.pred, ylim = c(300000, 590000), xlab = “quarters”, ylab = “gdp”) lines(y)

112 Set options for numeric display

options(scipen = 999, digits=6)

113 Display a summary of the fitted time series linear model

summary(gdp.tslm)

113.1

113.2 Call:

113.3 tslm(formula = formula, data = xTrain, lambda = 1)

113.4

113.5 Residuals:

113.6 Min 1Q Median 3Q Max

113.7 -11899 -2788 686 2740 8312

113.8

113.9 Coefficients:

113.10 Estimate Std. Error t value Pr(>|t|)

113.11 (Intercept) 35376.01 17571.13 2.01 0.048 *

113.12 trend -2949.99 213.33 -13.83 < 0.0000000000000002 ***

113.13 V1 -8.37 1.01 -8.31 0.00000000001 ***

113.14 V2 2438.62 111.14 21.94 < 0.0000000000000002 ***

113.15

113.16 Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

113.17

113.18 Residual standard error: 4350 on 63 degrees of freedom

113.19 Multiple R-squared: 0.988, Adjusted R-squared: 0.987

113.20 F-statistic: 1.68e+03 on 3 and 63 DF, p-value: <0.0000000000000002

114 Evaluate the accuracy of the forecast

accuracy(yValid, gdp.tslm.pred$mean)

114.1 ME RMSE MAE MPE MAPE ACF1 Theil’s U

114.2 Test set 7372.67 13573.3 10878.3 1.38981 2.1273 0.847148 2.55948

Lab 15: Binary forecast and logistic regression

Example – Binary Regression

Cam Calderon

2023-12-12

115 Install and load necessary packages

#install.packages(“caret”) # Uncomment this line to install the caret package if not already installed library(caret) # Load the caret package for advanced data analysis

115.1 Warning: package ‘caret’ was built under R version 4.3.2

115.2 Loading required package: ggplot2

115.3 Loading required package: lattice

116 Read and preprocess the data

rain.df <- read.csv(“C:/Users/Cam/OneDrive - De Montfort University/a- BECS2002 2023-24 - PUBLIC/002 Datasets/MelbourneRainfall.csv”) # Read the rainfall data from a CSV file rain.df\(Date <- as.Date(rain.df\)Date, format=“%d/%m/%Y”) # Convert the Date column to Date format rain.df\(Rainy <- ifelse(rain.df\)Rainfall > 0, 1, 0) # Create a binary Rainy variable: 1 if Rainfall > 0, otherwise 0

117 Create lagged and seasonal features

nPeriods <- length(rain.df\(Rainy) # Calculate the number of time periods in the dataset rain.df\)Lag1 <- c(NA, rain.df\(Rainfall[1:(nPeriods-1)]) # Create a Lag1 feature representing the previous day's rainfall rain.df\)t <- seq(1, nPeriods, 1) # Create a time index t rain.df\(Seasonal_sine = sin(2 * pi * rain.df\)t / 365.25) # Create a seasonal sine feature rain.df\(Seasonal_cosine = cos(2 * pi * rain.df\)t / 365.25) # Create a seasonal cosine feature

118 Split the data into training and validation sets

train.df <- rain.df[rain.df\(Date <= as.Date("31/12/2009", format="%d/%m/%Y"),] # Define the training set as data up to the end of 2009 train.df <- train.df[-1,] # Remove the first row to avoid NA in Lag1 valid.df <- rain.df[rain.df\)Date > as.Date(“31/12/2009”, format=“%d/%m/%Y”),] # Define the validation set as data from 2010 onwards xvalid <- valid.df[, c(4,6,7)] # Select relevant columns for the validation set

119 Build and summarize the logistic regression model

rainy.lr <- glm(Rainy ~ Lag1 + Seasonal_sine + Seasonal_cosine, data = train.df, family = “binomial”) # Fit a logistic regression model summary(rainy.lr) # Display a summary of the model

119.1

119.2 Call:

119.3 glm(formula = Rainy ~ Lag1 + Seasonal_sine + Seasonal_cosine,

119.4 family = “binomial”, data = train.df)

119.5

119.6 Coefficients:

119.7 Estimate Std. Error z value Pr(>|z|)

119.8 (Intercept) -0.76888 0.03858 -19.927 < 2e-16 ***

119.9 Lag1 0.11187 0.01137 9.843 < 2e-16 ***

119.10 Seasonal_sine -0.26885 0.05049 -5.324 1.01e-07 ***

119.11 Seasonal_cosine -0.37134 0.05067 -7.328 2.33e-13 ***

119.12

119.13 Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

119.14

119.15 (Dispersion parameter for binomial family taken to be 1)

119.16

119.17 Null deviance: 4751.8 on 3651 degrees of freedom

119.18 Residual deviance: 4533.7 on 3648 degrees of freedom

119.19 AIC: 4541.7

119.20

119.21 Number of Fisher Scoring iterations: 4

120 Make predictions and evaluate the model

rainy.lr.pred <- predict(rainy.lr, xvalid, type = “response”) # Make predictions on the validation set confusionMatrix(table(ifelse(rainy.lr\(fitted > 0.5, 1, 0), train.df\)Rainy)) # Confusion matrix for the training set

120.1 Confusion Matrix and Statistics

120.2

120.3

120.4 0 1

120.5 0 2251 1115

120.6 1 104 182

120.7

120.8 Accuracy : 0.6662

120.9 95% CI : (0.6507, 0.6815)

120.10 No Information Rate : 0.6449

120.11 P-Value [Acc > NIR] : 0.003566

120.12

120.13 Kappa : 0.1166

120.14

120.15 Mcnemar’s Test P-Value : < 2.2e-16

120.16

120.17 Sensitivity : 0.9558

120.18 Specificity : 0.1403

120.19 Pos Pred Value : 0.6687

120.20 Neg Pred Value : 0.6364

120.21 Prevalence : 0.6449

120.22 Detection Rate : 0.6164

120.23 Detection Prevalence : 0.9217

120.24 Balanced Accuracy : 0.5481

120.25

120.26 ‘Positive’ Class : 0

120.27

confusionMatrix(table(ifelse(rainy.lr.pred > 0.5, 1, 0), valid.df$Rainy)) # Confusion matrix for the validation set

120.28 Confusion Matrix and Statistics

120.29

120.30

120.31 0 1

120.32 0 373 220

120.33 1 21 55

120.34

120.35 Accuracy : 0.6398

120.36 95% CI : (0.6021, 0.6762)

120.37 No Information Rate : 0.5889

120.38 P-Value [Acc > NIR] : 0.004043

120.39

120.40 Kappa : 0.1647

120.41

120.42 Mcnemar’s Test P-Value : < 2.2e-16

120.43

120.44 Sensitivity : 0.9467

120.45 Specificity : 0.2000

120.46 Pos Pred Value : 0.6290

120.47 Neg Pred Value : 0.7237

120.48 Prevalence : 0.5889

120.49 Detection Rate : 0.5575

120.50 Detection Prevalence : 0.8864

120.51 Balanced Accuracy : 0.5734

120.52

120.53 ‘Positive’ Class : 0

120.54

Understanding the Confusion Matrix

A confusion matrix is a table used to evaluate the performance of a classification model. It summarizes the number of correct and incorrect predictions, comparing the actual values with the model’s predictions. It’s particularly useful for understanding the performance of a model in binary classification tasks.

The key terms in a confusion matrix are:

True Positives (TP): Correctly predicted positive observations.

True Negatives (TN): Correctly predicted negative observations.

False Positives (FP): Incorrectly predicted positive observations (Type I error).

False Negatives (FN): Incorrectly predicted negative observations (Type II error).

Key Points to Discuss

Accuracy: This is the proportion of total correct predictions (both true positives and true negatives) out of all predictions. It’s a good measure when the classes are balanced but can be misleading when dealing with imbalanced classes.

Sensitivity (Recall): This indicates how well the model is at predicting positive class. High sensitivity means the model is good at catching positives but might also have more false positives.

Specificity: This is about identifying negatives. A high specificity means the model is good at avoiding false positives but might miss some positives (false negatives).

Precision (Positive Predictive Value): This tells you how many of the positive predictions were actually positive.

Negative Predictive Value: This is about the proportion of negatives that were correctly identified.

Balance Between Sensitivity and Specificity: Ideally, a model should have a good balance between sensitivity and specificity. However, depending on the application, you might prefer one over the other (e.g., in medical diagnostics, high sensitivity might be preferred).

Differences Between Training and Validation Results: It’s essential to compare the performance on the training set and the validation set to check for overfitting. If a model performs significantly better on the training set than on the validation set, it might be overfitted.

Lab 16: Neural Networks I

Cam Calderon

2023-12-12

NEURAL NETS

121 Load the ‘readxl’ library for reading Excel files

library(readxl)

122 Read the GBP/USD exchange rate data from an Excel file

GBPUSD <- read_excel(“C:/Users/Cam/OneDrive - De Montfort University/a- BECS2002 2023-24 - PUBLIC/002 Datasets/GBBUSD.xlsx”)

123 Convert the data to a time series object starting from Dec 2003 to Dec 2023 with monthly frequency

GBPUSD.ts <- ts(GBPUSD$gbpusd, start = c(2003,12), end = c(2023, 12), freq = 12)

124 Load the ‘forecast’ library for time series forecasting

library(forecast)

124.1 Registered S3 method overwritten by ‘quantmod’:

124.2 method from

124.3 as.zoo.data.frame zoo

125 Plot the time series data

plot(GBPUSD.ts, xlab = “Year”, ylab = “Exchange rate”, main = “GBP/USD”)

126 Set the number of steps to forecast ahead

stepsAhead <- 12

127 Calculate the length of the training dataset

nTrain <- length(GBPUSD.ts) - stepsAhead

128 Create a training time series dataset

train.ts <- window(GBPUSD.ts, start = c(2003, 12), end = c(2003,nTrain+11))

129 Create a validation time series dataset

valid.ts <- window(GBPUSD.ts, start = c(2003, nTrain + 12), end = c(2003, nTrain + stepsAhead+11))

130 Build a linear model with trend and seasonality for the training dataset

train.lm.trend.season <- tslm(train.ts ~ trend + I(trend^2) + season)

131 Display summary statistics of the model

summary(train.lm.trend.season)

131.1

131.2 Call:

131.3 tslm(formula = train.ts ~ trend + I(trend^2) + season)

131.4

131.5 Residuals:

131.6 Min 1Q Median 3Q Max

131.7 -0.28243 -0.07537 -0.01590 0.06872 0.30900

131.8

131.9 Coefficients:

131.10 Estimate Std. Error t value Pr(>|t|)

131.11 (Intercept) 1.937e+00 3.433e-02 56.428 < 2e-16 ***

131.12 trend -3.660e-03 4.667e-04 -7.842 2.04e-13 ***

131.13 I(trend^2) 2.217e-06 1.965e-06 1.128 0.261

131.14 season2 -1.862e-03 3.769e-02 -0.049 0.961

131.15 season3 -7.891e-03 3.769e-02 -0.209 0.834

131.16 season4 1.260e-02 3.769e-02 0.334 0.739

131.17 season5 5.390e-03 3.769e-02 0.143 0.886

131.18 season6 5.085e-03 3.769e-02 0.135 0.893

131.19 season7 1.468e-02 3.770e-02 0.389 0.697

131.20 season8 1.776e-03 3.770e-02 0.047 0.962

131.21 season9 -3.495e-03 3.770e-02 -0.093 0.926

131.22 season10 2.450e-03 3.770e-02 0.065 0.948

131.23 season11 3.764e-03 3.771e-02 0.100 0.921

131.24 season12 -6.615e-03 3.723e-02 -0.178 0.859

131.25

131.26 Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

131.27

131.28 Residual standard error: 0.1162 on 215 degrees of freedom

131.29 Multiple R-squared: 0.7743, Adjusted R-squared: 0.7606

131.30 F-statistic: 56.73 on 13 and 215 DF, p-value: < 2.2e-16

132 Build a simpler linear model with only trend for the training dataset

train.lm.trend.season <- tslm(train.ts ~ trend)

133 Display summary statistics of the simpler model

summary(train.lm.trend.season)

133.1

133.2 Call:

133.3 tslm(formula = train.ts ~ trend)

133.4

133.5 Residuals:

133.6 Min 1Q Median 3Q Max

133.7 -0.29007 -0.07364 -0.01599 0.05980 0.30994

133.8

133.9 Coefficients:

133.10 Estimate Std. Error t value Pr(>|t|)

133.11 (Intercept) 1.9193772 0.0150599 127.45 <2e-16 ***

133.12 trend -0.0031494 0.0001135 -27.74 <2e-16 ***

133.13

133.14 Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

133.15

133.16 Residual standard error: 0.1136 on 227 degrees of freedom

133.17 Multiple R-squared: 0.7722, Adjusted R-squared: 0.7712

133.18 F-statistic: 769.5 on 1 and 227 DF, p-value: < 2.2e-16

134 Predict future values using the simpler model

train.lm.t.s.pred <- forecast(train.lm.trend.season, h = stepsAhead, level = 0)

135 Plot the forecasted values

plot(train.lm.t.s.pred, ylim = c(1,2.2), ylab = “GBP/USD”, xlab = “Year”, bty = “l”, xaxt = “n”, xlim = c(2003,2025), main =“Quarterly value change”, flty = 2)

136 Add custom year axis

axis(1, at = seq(2003, 2025, 1), labels = format(seq(2003, 2025, 1)))

137 Add lines to the plot for the fitted and actual values

lines(train.lm.t.s.pred$fitted, lwd = 2, col = “blue”) lines(valid.ts)

138 Forecast the next 12 time points and calculate accuracy

fcast <- forecast(train.lm.trend.season, h=12) accuracy(fcast$mean, valid.ts)

138.1 ME RMSE MAE MPE MAPE ACF1 Theil’s U

138.2 Test set 0.0675716 0.07356447 0.0675716 5.38607 5.38607 0.4909296 2.734794

139 Set a random seed for reproducibility

set.seed(201)

140 Build a neural network model for time series forecasting

GBPUSD.nnetar <- nnetar(train.ts, repeats = 20, p = 11, P = 1, size = 7)

141 Display summary of the neural network model

summary(GBPUSD.nnetar$model[[1]])

141.1 a 12-7-1 network with 99 weights

141.2 options were - linear output units

141.3 b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1 i8->h1 i9->h1

141.4 -0.20 -0.21 -1.60 0.43 0.53 0.31 -0.81 -0.86 -1.79 -0.34

141.5 i10->h1 i11->h1 i12->h1

141.6 2.56 1.38 0.82

141.7 b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2 i6->h2 i7->h2 i8->h2 i9->h2

141.8 -2.55 -2.86 -3.25 0.16 -4.35 5.22 3.62 1.75 0.42 -4.16

141.9 i10->h2 i11->h2 i12->h2

141.10 0.31 -1.71 1.05

141.11 b->h3 i1->h3 i2->h3 i3->h3 i4->h3 i5->h3 i6->h3 i7->h3 i8->h3 i9->h3

141.12 0.39 -0.33 0.72 -0.95 0.67 -1.03 -0.41 0.02 0.38 -0.35

141.13 i10->h3 i11->h3 i12->h3

141.14 0.48 -0.58 -0.71

141.15 b->h4 i1->h4 i2->h4 i3->h4 i4->h4 i5->h4 i6->h4 i7->h4 i8->h4 i9->h4

141.16 0.61 0.24 2.34 -0.75 -3.61 2.74 2.16 -3.91 -1.23 0.60

141.17 i10->h4 i11->h4 i12->h4

141.18 0.65 1.43 2.42

141.19 b->h5 i1->h5 i2->h5 i3->h5 i4->h5 i5->h5 i6->h5 i7->h5 i8->h5 i9->h5

141.20 -0.51 -1.65 -1.27 0.91 -1.55 1.61 0.79 2.00 0.63 -0.91

141.21 i10->h5 i11->h5 i12->h5

141.22 -1.46 0.40 -0.58

141.23 b->h6 i1->h6 i2->h6 i3->h6 i4->h6 i5->h6 i6->h6 i7->h6 i8->h6 i9->h6

141.24 0.28 -0.55 4.71 -1.40 -0.90 -0.95 2.54 -0.73 4.05 0.77

141.25 i10->h6 i11->h6 i12->h6

141.26 -5.06 -3.66 -0.12

141.27 b->h7 i1->h7 i2->h7 i3->h7 i4->h7 i5->h7 i6->h7 i7->h7 i8->h7 i9->h7

141.28 3.25 -3.75 4.04 -0.45 -4.41 -1.99 2.81 -5.32 7.86 -6.64

141.29 i10->h7 i11->h7 i12->h7

141.30 2.48 -4.27 -0.53

141.31 b->o h1->o h2->o h3->o h4->o h5->o h6->o h7->o

141.32 4.37 -2.82 0.67 -3.29 -0.38 -2.34 -1.36 0.72

142 Predict future values using the neural network model

GBPUSD.nnetar.pred <- forecast(GBPUSD.nnetar, h = stepsAhead)

143 Calculate accuracy of the neural network predictions

accuracy(GBPUSD.nnetar.pred, valid.ts)

143.1 ME RMSE MAE MPE MAPE

143.2 Training set 0.0001190531 0.02059234 0.0153757 -0.01928677 1.026308

143.3 Test set -0.1579077603 0.19391569 0.1580158 -12.68557381 12.694344

143.4 MASE ACF1 Theil’s U

143.5 Training set 0.1371076 0.01308142 NA

143.6 Test set 1.4090524 0.76056006 7.320825

144 Plot the training time series with predictions

plot(train.ts, ylim = c(1, 2.2), ylab = “GBP/USD”, xlab = “Time”, bty = “l”, xaxt = “n”, xlim = c(2003,2025), lty = 1) axis(1, at = seq(2003, 2025, 1), labels = format(seq(2003, 2025, 1))) lines(GBPUSD.nnetar.pred\(fitted, lwd = 2, col = "blue") lines(GBPUSD.nnetar.pred\)mean, lwd = 2, col = “blue”, lty = 2) lines(valid.ts)

145 Forecast the next 12 time points using the neural network model

fcast <- forecast(GBPUSD.nnetar, h=12)

146 Calculate accuracy of the neural network model’s forecast

accuracy(fcast$mean, valid.ts)

146.1 ME RMSE MAE MPE MAPE ACF1 Theil’s U

146.2 Test set -0.1579078 0.1939157 0.1580158 -12.68557 12.69434 0.7605601 7.320825

147 Reinitialize the random seed

set.seed(201)

148 Rebuild the neural network model using the entire dataset

GBPUSD.nnetar <- nnetar(GBPUSD.ts, repeats = 20, p = 11, P = 1, size = 7)

149 Display summary of the neural network model

summary(GBPUSD.nnetar$model[[1]])

149.1 a 12-7-1 network with 99 weights

149.2 options were - linear output units

149.3 b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1 i8->h1 i9->h1

149.4 -2.20 -0.55 -2.93 2.61 -0.83 2.81 0.94 0.96 -4.78 1.08

149.5 i10->h1 i11->h1 i12->h1

149.6 -0.49 0.81 0.74

149.7 b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2 i6->h2 i7->h2 i8->h2 i9->h2

149.8 0.34 -1.04 2.02 -1.51 -2.22 -3.26 -1.41 1.46 2.97 -1.69

149.9 i10->h2 i11->h2 i12->h2

149.10 -4.14 -3.26 2.62

149.11 b->h3 i1->h3 i2->h3 i3->h3 i4->h3 i5->h3 i6->h3 i7->h3 i8->h3 i9->h3

149.12 0.23 -0.93 1.62 -2.05 2.24 -2.08 -0.01 -0.17 -0.58 0.70

149.13 i10->h3 i11->h3 i12->h3

149.14 2.29 -1.60 -0.03

149.15 b->h4 i1->h4 i2->h4 i3->h4 i4->h4 i5->h4 i6->h4 i7->h4 i8->h4 i9->h4

149.16 -1.46 -1.41 0.88 -1.37 3.24 0.77 2.16 -0.62 -1.63 -0.22

149.17 i10->h4 i11->h4 i12->h4

149.18 3.79 -3.03 6.84

149.19 b->h5 i1->h5 i2->h5 i3->h5 i4->h5 i5->h5 i6->h5 i7->h5 i8->h5 i9->h5

149.20 -0.04 -0.39 -1.46 2.05 -2.50 1.99 -0.12 0.03 1.47 -1.07

149.21 i10->h5 i11->h5 i12->h5

149.22 -2.61 1.94 -0.02

149.23 b->h6 i1->h6 i2->h6 i3->h6 i4->h6 i5->h6 i6->h6 i7->h6 i8->h6 i9->h6

149.24 2.15 0.48 1.11 -1.17 -2.56 1.90 -5.83 3.64 2.07 -4.07

149.25 i10->h6 i11->h6 i12->h6

149.26 0.23 2.87 -3.71

149.27 b->h7 i1->h7 i2->h7 i3->h7 i4->h7 i5->h7 i6->h7 i7->h7 i8->h7 i9->h7

149.28 1.44 -2.31 2.05 -0.34 -4.69 -0.22 -5.12 2.20 5.54 -6.61

149.29 i10->h7 i11->h7 i12->h7

149.30 -1.72 -3.78 5.35

149.31 b->o h1->o h2->o h3->o h4->o h5->o h6->o h7->o

149.32 4.21 -1.12 -1.14 -3.19 -0.64 -2.77 -1.06 1.07

150 Forecast the next 60 time points using the neural network model

GBPUSD.nnetar.pred <- forecast(GBPUSD.nnetar, h = 60)

151 Plot the entire time series with the neural network model’s predictions

plot(GBPUSD.ts, ylim = c(1, 2.2), ylab = “GBP/USD (%)”, xlab = “Time”, bty = “l”, xaxt = “n”, xlim = c(2003,2030), lty = 1) axis(1, at = seq(2003, 2030, 1), labels = format(seq(2003, 2030, 1))) lines(GBPUSD.nnetar.pred\(fitted, lwd = 2, col = "blue") lines(GBPUSD.nnetar.pred\)mean, lwd = 2, col = “blue”, lty = 2)

152 Forecast the next 60 time points using the neural network model and calculate accuracy

fcast <- forecast(GBPUSD.nnetar, h=60) accuracy(GBPUSD.nnetar.pred$fitted, GBPUSD.ts)

152.1 ME RMSE MAE MPE MAPE ACF1

152.2 Test set -7.349151e-06 0.0210837 0.01593545 -0.02976382 1.082309 -0.006698643

152.3 Theil’s U

152.4 Test set 0.560275

Lab 17: neural Networks II

(there is a R script on the notes in the slides)

This is the last session about analytics and Q&A session about your projects.

PRACTICE LABS

Lab 13 – models with trend and seasonality

As in the previous labs, this week, we will cover a practical example linked to the immediate last lecture. In my youtube channel, I produced one example using European GDP; for future reference, you can review .

ACTIVITY ONE – Retail Sales

Task1. The retail sector is without doubts one of the most affected sectors during the pandemic; your first task is to read the latest report from ONS,

Task 2. Go to the folder ‘R Code’ in bb and download and open the file ‘Models with trend and seasonality. I wrote this R code for the excel file ‘ONS retail.xls’, the excel file is available in the folder ‘Data sets’ in bb, you shall also download the excel file and place the file any folder such as ‘Documents’; note we created a default working directory for econ2545 on a previous lab.

Run the entire code and check line by line to understand what you are doing, if you have questions about any line in the code, ask your lecturer during the session.

Task 2. The excel file Retails sales is outdated, and it would be fantastic to analyse the retail sales as this sector is very important for the UK economy. You must update the excel file using the ONS website or Refinitiv. Section 7, on the ONS link above, contains data for non seasonally adjusted retail sales, which could be better as we would like to forecast with seasonality.

Task 3. Know that you know how to adapt the R code to new data, its time to use the dataset you will use for your first assessment (presentation 40%). Adapt the code and ask questions to your lecturer if needed.

Lab 15- Foreign Exchange Rate Market

Activity 1. FX Top of Book

Task 1. In the Eikon Toolbar, search .

Task 2. Press the icon settings in order to add new profiles (groups of currencies) or different currencies into a group.

Activity 2. Currency Performance/ Value Tracker

Using the Currency Performance/ Value Tracker application, users can compare how a specific currency performed historically and currently. While the 3 months implied volatility implies the market’s sentiments on the currency pair currently, the 3 months realised volatility is the traded volatility 3 months ago. Hence, comparing the two figures could be beneficial to an options buyer as to whether they should buy or sell an option right now.

Task 1. In the Eikon Toolbar, search

For example, Here, the 3M implied volatility for USDCAD is 6.75 while the 3M realised volatility is 6.57. Hence the currency pair USDCAD is currently more volatile as compared to 3M ago.

Have a look at the current volatiliries, what is the most volatile currency pair using both implied and realised volatility?

ACTIVITY 3. FX Polls / FX forecast

Task1. Type FX Polls or FX Forecast in the application Search box located on the top left of your screen.

You can also access FX Polls by going to Refinitiv Icon next to the search box and Click > Markets > FX Overview Guide, and clicking the FX Polls link available in the MARKET TOOLS area.

Task 2. Choose one of the forecast contributors and check the box ‘Show Historical Poll Data’. In your opinion, what is the best forecaster for EURUSD?

Activity 4. Calculating GBPUSD autocorrelation

Task 1. Get the time series in R.

152.4.0.1 API

#install.packages(“devtools”)

library(devtools)

#install_github(“ahmedmohamedali/eikonapir”, force=TRUE)

library(eikonapir)

eikonapir::set_proxy_port(9000L)

eikonapir::set_app_id(‘your Refinitiv Key - APPKEY’)

gbp <- get_timeseries(“GBP=”,

start_date = “2013-01-01T01:00:00”,

end_date= paste0(Sys.Date(), “T01:00:00”),

raw_output = FALSE, interval = “monthly”)

str(gbp)

gbp.n <- as.numeric(gbp$‘CLOSE’)

gbp.df <-as.data.frame(gbp.n)

View(gbp.df)

Task 2. Declare the dataset as a time series and run an ACF plot to check the autocorrelation.

#install.packages(“forecast”)

library(forecast)

#install.packages(“zoo”)

library(zoo)

153 set as a time series and plot the data

gbp.ts <- ts(gbp.df, start = c(2013,1), end = c(2020, 12), freq = 12)

plot(gbp.ts)

#Autocorrelation ACF

Acf(gbp.ts, lag.max = 24, main = “ACF GBPUSD”)

Task 3. Generate partitions of the data

stepsAhead <- 12

nTrain <- length(gbp.ts) - stepsAhead

train.ts <- window(gbp.ts, start = c(2013, 1), end = c(2013,nTrain))

valid.ts <- window(gbp.ts, start = c(2013, nTrain + 1), end = c(2013, nTrain + stepsAhead))

Task 4. Estimate the training period using linear regression

154 summary of output from fitting additive seasonality

155 note: cuadratic trend and seasonality removed as no significant

train.lm.trend.season <- tslm(train.ts ~ trend )

summary (train.lm.trend.season)

Task 5. Is there autocorrelation in the errors?

156 is there autocorrelation?

Acf(train.lm.trend.season$residuals, lag.max = 24, main = “Linear regression errors - ACF GBPUSD”)

Task 6. Fit an AR(1) model with the errors to improve.

train.res.arima <- Arima(train.lm.trend.season$residuals, order = c(1,0,0))

summary(train.res.arima)

Acf(train.res.arima$residuals, lag.max = 24, main = “ACF GBPUSD”)

It’s better, but we can carry on using AR (2) to improve.

Lab 16 – Money Market and Bond Yield Polls ; Why inverted yield curves are important?

Activity 1. Find in Refinitiv news the concepts of inverted yield curves, and operation’ twist’.

Activity 2. Using linear regression Forecast the short run yields and compare your forecast with the best contributors forecast.

Task 1. Use the command to open a window with Money Market and Bond Yield polls. Choose as country ‘United States’, and in poll type select ‘Bond Yields’.

Task 2. Select two specific contributors; a line representing their forecast could be added by clicking the box next to their name. Try to choose two good contributors by visualising their past forecasting performance.

Task 3. If you have technical problems with Mac and Refinitiv, you can download the file ‘yields.xlsx’ from bb folder datasets. Click on the excel icon and download 5-years data for your selected contributors. You must clean the file as in the picture below.

Task 4. Find a better forecast than the ones performed by the contributors. Following your forecast, recommend whether operation’ twist’ will be needed in the US.

#install.packages(“forecast”)

library(forecast)

#install.packages(“lubridate”)

library(lubridate)

library(readxl)

library(zoo)

yields <- read_excel(“~/econ2545/datasets/yields.xlsx”,

skip = 6)

#View(yields)

yields\(date <- as.Date(yields\)date, format = “%d/%m/%Y”)

x <- as.data.frame(cbind(c(NA,head(yields\(y2actual,-1)), yields\)y2bbva))

x

#View(x)

y <- yields$y2actual

#View(y)

nTotal <- length(y)-4

nValid <- 4

nTrain <- nTotal - nValid

xTrain <- x[1:nTrain,]

yTrain <- y[1:nTrain]

#View(xTrain)

xValid <- x[(nTrain + 1):nTotal,]

yValid <- y[(nTrain + 1):nTotal]

yTrain.ts <- ts(yTrain)

(formula <- as.formula(paste(“yTrain.ts”, paste(c(““, colnames(xTrain)), collapse =”+“), sep =”~“)))

yields.tslm <- tslm(formula, data = xTrain)

yields.tslm.pred <- forecast(yields.tslm, newdata = xValid)

plot(y, ylim = c(-1, 6), xlab = “Quarters”, ylab = “Quarterly Yields”, xaxt=“n”)

axis(1, at = seq(1, 20, 4), labels = format(seq(2017, 2021, 1)))

options(scipen = 999, digits=6)

summary(yields.tslm)

#lines(y)

lines(yields$y2aeconomics, col=“blue”)

lines(yields$y2amundi, col=“orange”)

lines(yields$y2bbva, col=“red”)

lines(yields.tslm.pred$mean, col=“black”, type = “l”, lwd=5)

lines(yields.tslm.pred$fitted)

xValid2 <- x[(nTrain + 1):(nTotal+4),]

yields.tslm2 <- tslm(formula, data = xTrain, lambda = 1)

yields.tslm.pred2 <- forecast(yields.tslm2, newdata = xValid2)

lines(yields.tslm.pred2$mean)

xValid2

Lab 18 – Neural Networks

Forecasting Australian Wine Sales: The figure shows time plots of monthly sales of six types of Australian wines (red, rose, sweet white, dry white, sparkling, and fortified) for

1980-1994. Data available in AustralianWines.xls. The units are thousands of liters.

You are hired to obtain short-term forecasts (2-3 months ahead) for each of the six series, and this task will be repeated every month.

Would you consider neural networks for this task? Explain why.

No: these series are not high frequency; they exhibit seasonality and/or trend, which should be possible to model with linear regression and/or smoothing.

Use neural networks to forecast fortified wine sales, as follows:

Partition the data using the period until December 1993 as the training period.

Run a neural network using R’s nnetar with 11 non-seasonal lags (i.e., p = 11). Leave all other arguments at their default

dat = read.csv(“AustralianWines.csv”) require(forecast)

require(zoo) require(caret)

157 Just fortified wine dat <- dat[,1:2]

158 Create ts object

dat.ts <- ts(dat$Fortified, start = c(1980,1), end = c(1994,12), freq = 12)

159 Slice the sets

train.ts <- window(dat.ts, start = c(1980,1), end = c(1993,12)) valid.ts <- window(dat.ts, start = c(1994,1), end = c(1994,12))

160 Train the neural net

train.nnetar <- nnetar(train.ts,p = 11) summary(train.nnetar$model[[1]]) train.ts.pred <- forecast(train.nnetar,h = 12) accuracy(train.ts.pred, valid.ts)

Create a time plot for the actual and forecasted series over the training period. Create also a time plot of the forecast errors for the training period. Interpret what you see in the plots.

161 plot results

plot(train.ts, ylim = c(-200,5500), ylab = “Fortified”, xlab = “Time”, bty = “l”, xlim = c(1980,1994), lty = 1)

lines(train.ts.pred\(fitted, lwd = 1, col = "blue") lines(train.ts.pred\)residuals, col = “red”) lines(train.ts.pred$mean, lwd = 1, col = “blue”, lty = 2) lines(valid.ts)

The fit is too good (indicating overfitting)

Use the neural net to forecast sales in January and February 1994.

Jan 1994: 1,426.92

Feb 1994: 1,254.17

Compare your neural net to an exponential smoothing model used to forecast fortified wine sales.

Use R’s ets function to automatically select and fit an exponential smoothing model to the training period until December 1993. Which model did ets fit?

The model is M,A,M: Multiplicative Error, Additive trend, Multiplicative Seasonality with alpha =

0.055 and beta and gamma set to near zero:

train.ets <- ets(train.ts)

summary(train.ets) ETS(M,A,M)

Call:

ets(y = train.ts) Smoothing parameters: alpha = 0.0555

beta = 9e-04 gamma = 1e-04 Initial states:

l = 4040.0811

b = -6.7983

s=1.1316 1.0399 0.8877 0.9505 1.2722 1.3862 1.1463 1.1097 0.9345 0.8513 0.6996 0.5903

sigma: 0.0859 AIC AICc BIC

2755.038 2759.118 2808.145

Training set error measures:

ME RMSE MAE MPE MAPE MASE ACF1

Training set -25.32466 287.8687 224.6507 -1.317643 7.229271 0.8073515 0.05168201

Use this exponential smoothing model to forecast sales in January and February 1994.

train.ets.pred <- forecast(train.ets, h = 2) Jan 1994: 1289.829164

Feb 1994: 1521.475001

How does this setting compare to the lagged-series setting in terms of predictive performance?

accuracy(train.ts.pred, valid.ts)

accuracy(train.ets.pred, valid.ts)

Accuracy of NN Model:

ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U

Training set -0.18756 77.39955 59.99268 -0.22975 2.174161 0.215602 -0.07679 NA

Test set 200.3241 348.627 312.5818 7.559032 14.33551 1.123359 -0.25296 0.812298

Accuracy of ETS Model:

ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U

Training set -25.3247 287.8687 224.6507 -1.31764 7.229271 0.807351 0.051682 NA

Test set 125.5691 328.9246 256.394 4.443793 10.85886 0.921431 -0.01106 0.714046

Supplementary Lab 1

Refinitiv training

Click on the question mark icon.

Then click on training.

Click on ‘Live Training’.

Schedule one of the courses at Refinitiv.

Supplementary Lab – Social, Macro Overview and Signal

Activity 1 -

Social Media Monitor The Social Media Monitor uses text mining to view trends in the markets from a social perspective over the last few days. Eikon users can choose different instruments by searching in the in-app search bar and selecting a sentiment line or bar chart to visually view a particular company’s positivity or negativity in the social media network. “social” would give you an idea about the market sentiment regarding a specific company and help predict possible stock price movements.

  1. In the Eikon Toolbar, search to open the social monitor application

  2. Select the different instruments and click on the symbol to display it on the sentiment line or bar chart.

  3. The leftmost column would display the latest tweets and posts about the company of interest.

Activity 2 – Macroeconomic Overview

The economic overview page allows to download historical data for each country.

Look at the ‘Economic Overview’ of Mexico.

Review GDP per capita constant prices

Select the maximum number of years for the historical data.

Review and follow the training video about ‘Macroeconomic Overview’. Training videos can be found in the icon “?” at the top of your Refinitiv workspace.

Supplementary Activity 3 -

Signal monitors a list of securities real-time against one or more technical criteria. It immediately alerts you in the signal panel when a security meets one or more of the conditions. In the Eikon toolbar, search to open the application.

  1. Click the icon next to the “Add symbol or portfolio” option to customise which Portfolios, Chain RICs or Individual RICs you are interested in.

  2. Click the icon next to the “Add signal” option to customise the type of signal you want to monitor.

Supplementary Activity

Questions about students’ projects.

Supplementary Lab – Alternative investing and CBA

ACTIVITY 1 – Alternative investing

There are alternative investments to stocks, bonds and forex (foreign exchange rate market). One example is yieldstreet.com, a company that provides access to alternative investments previously reserved only for institutions and the ultra-wealthy. This company generates financial products more inclusive by creating a modern investment portfolio. We will review their site for educative purposes only as some of their projects and investments are a good example of cost-benefit analysis in the real life.

Go to

Click on the label ‘product’ and then ‘Real estate investing’.

Click on ‘view open offerings’ and then

Click on ‘Dallas-Fort Worth Multi-Family Equity I’, read to the proposal and identify the costs and benefits from the investor perspective.

Create or use a gmail account and click on ‘log in’

After log in, read the investment memorandum for more details about the investment.

ACTIVITY 2 – Cost Benefit questions

  1. Requiring bicycle riders to wear helmets, Boardman et al.(2014), CH1, exercise 1, p23

  2. Building a public swimming pool, Boardman et al.(2014), CH1, exercise 3, p24

  3. VHS vs Beatamax, , Boardman et al.(2014), CH1, exercise 1, p48

  4. Requiring bicycle riders to wear helmets, Boardman et al.(2014), CH1, exercise 1, p23

Imagine that you live in a city that currently does not require bicycle riders to wear helmets. Furthermore, imagine that you enjoy riding your bicycle without wearing a helmet.

  1. From your perspective, what are the major costs and benefits of a proposed city ordinance that would require all bicycle riders to wear helmets?

  2. What are the categories of costs and benefits from society’s perspective?

1.a. The most significant categories of costs to you as an individual are probably: the purchase price of a helmet, the reduced pleasure of riding your bicycle while wearing a helmet, diminished appearance when you take the helmet off (bad hair), and the inconvenience of keeping the helmet available. The most significant categories of benefits are probably: reduced risk of serious head injury (morbidity) and reduced risk of death (mortality).

1.b. There are a number of categories of costs and benefits that do not affect you (directly or are insignificant), but which are important in aggregate. These are:

program enforcement (a cost)

reduced health care costs (a benefit), although this may not be as high as one might expect if bicyclists ride more aggressively because they feel safer (this is called off-setting behaviour)

increased pollution, due to cyclists switching to cars (a cost)

A social cost-benefit analysis would take account of these costs and benefits in addition to your costs.

  1. Building a public swimming pool, Boardman et al.(2014), CH1, exercise 3, p24

(Instructor-provided spreadsheet recommended) Your county is considering building a public swimming pool. Analysts have estimated the present values of the following effects over the expected useful life of the pool:

PV

(million dollars)

State grant: 2.2

Construction and maintenance costs: 12.5

Personnel costs: 8.2

Revenue from county residents: 8.6

Revenue from non-residents: 2.2

Use value benefit to county residents: 16.6

Use value benefit to non-residents: 3.1

Scrap value: 0.8

The state grant is only available for this purpose. Also, the construction and maintenance will have to be done by an out-of-county firm.

  1. Assuming national-level standing, what are the social net benefits of the project?

  2. Assuming county-level standing, what are the social net benefits of the project?

  3. How would a guardian in the county budget office calculate net benefits?

  4. How would a spender in the county recreation department calculate net benefits?

  1. The spreadsheet available from the instructor web page facilitates the following estimates of net benefits (millions of dollars):

We recommend that instructors delete the cell entries under these columns and distribute the spreadsheet to students. As this is a very simple use of a spreadsheet, it makes a good introduction for students who have not used them before.

As an alternative, instructors can distribute the spreadsheet as provided and give the students a different set of costs and benefits.

  1. VHS vs Beatamax, , Boardman et al.(2014), CH1, exercise 1, p48

Many experts claim that, although VHS came to dominate the video recorder market, Betamax was a superior technology. Assume that these experts are correct, so that, all other things equal, a world in which all video recorders were Betamax technology would be Pareto superior to a world in which all video recorders were VHS technology. Yet it seems implausible that a policy that forced a switch in technologies would be even potentially Pareto improving Explain.

  1. Obviously, the switch itself from Betamax to VHS would be costly: the stocks of existing VHS tapes and equipment would lose their value and equipment for producing them would have to be retired earlier than would otherwise be the case. As the replacement would almost certainly occur gradually, there would be a transition period during which positive “network” externalities, the benefits from having compatible systems, would be reduced.

More generally, it is important to keep in mind the distinction between Pareto efficient outcomes and Pareto efficient moves. If everyone were at least as well off, and some were better off, in some alternative to the status quo, then the alternative would be considered Pareto superior. Yet, if the move to the alternative were sufficiently costly, then it would not be Pareto improving. Only if the move were costless, the common assumption in the comparison of alternative equilibria in economic theory, would the Pareto efficiency of outcomes correspond to the Pareto efficiency of moves. In the real world, moves are rarely costless so that policy alternatives are best thought of as moves rather then as outcomes.

Supplementary Lab - Willingness to pay

  1. Let’s explore the concept of willingness to pay with a thought experiment. Imagine a specific sporting, entertainment, or cultural event that you would very much like to attend-perhaps a World Cup match, the seventh game of the World Series, a Garth Brooks concert, or Kathleen Battle performance.
  1. What is the most you would be willing to pay for a ticket to the event?

  2. Imagine that you won a ticket to the event in a lottery. What is the minimum amount of money that you would be willing to accept to give up the ticket?

  3. Imagine that you had an income 50 percent higher than it is now, but that you didn’t win a ticket to the event. What is the most you would be willing to pay for a ticket?

  4. Do you know anyone who would sufficiently dislike the event that they would not use a free ticket unless they were paid to do so?

  5. Do your answers suggest any possible generalizations about willingness to pay?

2.a. Students’ answers will vary (they should be > or = 0).

2.b. Most people would be willing to pay less to obtain something than the amount of compensation they would require to give the same thing up willingly if they already owned it. This difference has been frequently observed and economists refer to it as “the difference between willingness to pay and willingness to accept.” Though some of the difference may be attributable to the lower wealth level of the individual in the first case than in the second case, it almost certainly also reflects the way people perceive gains and losses.

2.c. Willingness to pay depends on people’s wealth. If a person’s income rises, then the person is wealthier and is likely to be willing to pay more for goods such as tickets to recreational events. (Recreational events are normal goods.)

2.d. Different people can have very different willingness-to-pay amounts for the same good. Indeed, it is quite likely that some people would have a negative willingness to pay for a recreational event that others would be willing to pay large positive amounts to attend – tastes differ. In CBA, it is important to keep in mind that a project effect may simultaneously be viewed by some as a benefit and by others as a cost.

  1. How closely do government expenditures measure opportunity cost for each of the following program inputs?
  1. Time of jurors in a criminal justice program that requires more trials.

  2. Land to be used for a nuclear waste storage facility, which is owned by the government and located on a military base.

  3. Labor for a reforestation program in a small rural community with high unemployment.

  4. Labor of current government employees who are required to administer a new program.

  5. Concrete that was previously poured as part of a bridge foundation.

3.a. Most jurisdictions pay jurors a small per diem and reimburse them for commuting and meal expenses. For most jurors, these payments fall short of the opportunity costs of their time. For employed workers, a more reasonable estimate of the opportunity cost of their time would be their wage rates. Note that, from the social perspective, it makes no difference whether or not workers continue to receive their wages while on jury duty. Society is forgoing their labor, which the market values at their wage rates. For those not employed, the opportunity cost is the value they place on their forgone leisure.

3.b. Assume that the government does not charge itself for the use of land that it owns. As long as the land could be used for something other than a nuclear waste facility, the government’s accounting would underestimate the opportunity cost of the land. If the land could be sold to private developers, for example, then its market price would be a better reflection of its opportunity cost. If the fact that the land is on a military base precludes its sale to private developers, then the opportunity cost of the land would depend on the other uses to which it could be put by the government.

3.c. Government expenditures on wages would overestimate the opportunity cost if the workers would have otherwise been unemployed. The opportunity cost of the workers is the value they place on the leisure time that they are giving up.

3.d. As the employees are already on the government payroll, the diversion of their time to the program would not involve additional expenditures. The opportunity cost of their time depends on how they would have been using it in the absence of the program. If the government efficiently used labor, then the opportunity cost of their time would be measured by their wage rates. If the government inefficiently used labor, so that the value of output given up per hour diverted is less than their wage rate, then the opportunity cost would be less than the wage rate.

3.e. Once it is in place, the concrete has zero opportunity cost if it cannot be salvaged and reused, regardless of whether or not the government has yet paid the bill for it. This is the classic case of a “sunk cost.” Indeed, imagine that if the bridge project were to be cancelled. Then, for safety reasons, the concrete would have to be removed, requiring the use labor and equipment. Consequently, with respect to the bridge project, the opportunity cost of the concrete is negative – not having to remove it is a benefit of continuing the project!

  1. Three mutually exclusive projects are being considered for a remote river valley: Project R, a recreational facility, has estimated benefits of $10 million and costs of $8 million; project F, a forest preserve with some recreational facilities, has estimated benefits of $13 million and costs of $10 million; project W, a wilderness area with restricted public access, has estimated benefits of $5 million and costs of $1 million. In addition, a road could be built for a cost of $4 million that would increase the benefits of project R by $8 million, increase the benefits of project F by $5 million, and reduce the benefits of project W by $1 million. Even in the absence of any of the other projects, the road has estimated benefits of $2 million.

Calculate the benefit-cost ratio and net benefits for each possible alternative to the status quo. Note that there are seven possible alternatives to the status quo: R, F, and W, both with and without the road, and the road alone.

If only one of the seven alternatives can be selected, which should be selected according to the CBA decision rule?

4.a. The seven possible alternatives to the status quo have the following costs (millions), benefits (millions), benefit/cost ratios, and net benefits (millions):

Alternative B C B/C Ratio NB

Project R without road $10 $8 1.25 $2

Project R with road 18 12 1.50 6

Project F without road 13 10 1.30 3

Project F with road 18 14 1.38 4

Project W without road 5 1 5.00 4

Project W with road 4 5 0.80 -1

Road alone 2 4 0.50 -2

4.b. Even though Project W without the road has the largest benefit/cost ratio, Project R with the road offers the largest net benefits among the possible projects and therefore would be selected by the CBA decision rule.

Supplementary Lab – R Shiny

What is a Shiny App? R Shiny is a web application framework for R that allows you to create interactive web apps for data visualization and analysis without needing to know HTML, CSS, or JavaScript. It enables users to turn their R code into interactive dashboards, reports, or tools.

Let’s review an example. The idea is to look at what the package Shiny can do rather than the details of the coding. Go to the public directory for BECS 2002, open the folder R Shiny Tutorial, there is a subfolder ’01-01’, you will find there the R project ’01-01’ which is an example of what R shiny does. Double click to open the R project ’01-01’ in RStudio.

Double-click on server.R, ui.R and data-processing.R . Check the created tabs and install packages if necessary, look an example below, I installed the package shinycustomloader.

Go to the tab server.R and click on ‘Run App’.

Select any continent, country, indicator and click on update chart. This is an interactive web application! And it has been built entirely with R.

Concepts:

Shiny is a self contained web framework for building interactive web apps with R code.

An individual Shiny app in an interactive web app viewed in a web browser.

Shiny runs the R code on the ‘server’ not in the web browser.

You can deploy apps to the fully hosted shinyapps.io or self host using Shiny Server.